121 research outputs found

    Testing Interestingness Measures in Practice: A Large-Scale Analysis of Buying Patterns

    Full text link
    Understanding customer buying patterns is of great interest to the retail industry and has shown to benefit a wide variety of goals ranging from managing stocks to implementing loyalty programs. Association rule mining is a common technique for extracting correlations such as "people in the South of France buy ros\'e wine" or "customers who buy pat\'e also buy salted butter and sour bread." Unfortunately, sifting through a high number of buying patterns is not useful in practice, because of the predominance of popular products in the top rules. As a result, a number of "interestingness" measures (over 30) have been proposed to rank rules. However, there is no agreement on which measures are more appropriate for retail data. Moreover, since pattern mining algorithms output thousands of association rules for each product, the ability for an analyst to rely on ranking measures to identify the most interesting ones is crucial. In this paper, we develop CAPA (Comparative Analysis of PAtterns), a framework that provides analysts with the ability to compare the outcome of interestingness measures applied to buying patterns in the retail industry. We report on how we used CAPA to compare 34 measures applied to over 1,800 stores of Intermarch\'e, one of the largest food retailers in France

    Fouille et classement d'ensembles fermés dans des données transactionnelles de grande échelle.

    Get PDF
    The recent increase of data volumes raises new challenges for itemset miningalgorithms. In this thesis, we focus on transactional datasets (collections of itemssets, for example supermarket tickets) containing at least a million transactionsover hundreds of thousands items. These datasets usually follow a “long tail”distribution: a few items are very frequent, and most items appear rarely. Suchdistributions are often truncated by existing itemset mining algorithms, whoseresults concern only a very small portion of the available items (the most frequents,usually). Thus, existing methods fail to concisely provide relevant insights on largedatasets. We therefore introduce a new semantics which is more intuitive for theanalyst: browsing associations per item, for any item, and less than a hundredassociations at once.To address the items’ coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset,the k most frequent closed itemsets containing this item. We present an algorithmto solve it, TopPI. We show that TopPI computes efficiently interesting resultsover our datasets, outperforming simpler solutions or emulations based on existingalgorithms, both in terms of run-time and result completeness. We also show andempirically validate how TopPI can be parallelized, on multi-core machines andon Hadoop clusters, in order to speed-up computation on large scale datasets.Our second contribution is CAPA, a framework allowing us to study whichexisting measures of association rules’ quality are relevant to rank results. Thisconcerns results obtained from TopPI or from j LCM, our implementation of astate-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantita-tive study shows that the 39 quality measures we compare can be grouped into5 families, based on the similarity of the rankings they produce. We also involvemarketing experts in a qualitative study, in order to discover which of the 5 familieswe propose highlights the most interesting associations for their domain.Our close collaboration with IntermarchĂ©, one of our industrial partners in theDatalyse project, allows us to show extensive experiments on real, nation-widesupermarket data. We present a complete analytics workflow addressing this usecase. We also experiment on Web data. Our contributions can be relevant invarious other fields, thanks to the genericity of transactional datasets.Altogether our contributions allow analysts to discover associations of interestin modern datasets. We pave the way for a more reactive discovery of items’ asso-ciations in large-scale datasets, whether on highly dynamic data or for interactiveexploration systems.Les algorithmes actuels pour la fouille d’ensembles frĂ©quents sont dĂ©passĂ©s parl’augmentation des volumes de donnĂ©es. Dans cette thĂšse nous nous intĂ©ressonsplus particuliĂšrement aux donnĂ©es transactionnelles (des collections d’ensemblesd’objets, par exemple des tickets de caisse) qui contiennent au moins un mil-lion de transactions portant sur au moins des centaines de milliers d’objets. Lesjeux de donnĂ©es de cette taille suivent gĂ©nĂ©ralement une distribution dite en“longue traine”: alors que quelques objets sont trĂšs frĂ©quents, la plupart sontrares. Ces distributions sont le plus souvent tronquĂ©es par les algorithmes defouille d’ensembles frĂ©quents, dont les rĂ©sultats ne portent que sur une infimepartie des objets disponibles (les plus frĂ©quents). Les mĂ©thodes existantes ne per-mettent donc pas de dĂ©couvrir des associations concises et pertinentes au seind’un grand jeu de donnĂ©es. Nous proposons donc une nouvelle sĂ©mantique, plusintuitive pour l’analyste: parcourir les associations par objet, au plus une centaineĂ  la fois, et ce pour chaque objet prĂ©sent dans les donnĂ©es.Afin de parvenir Ă  couvrir tous les objets, notre premiĂšre contribution consisteĂ  dĂ©finir la fouille centrĂ©e sur les objets. Cela consiste Ă  calculer, pour chaqueobjet trouvĂ© dans les donnĂ©es, les k ensembles d’objets les plus frĂ©quents qui lecontiennent. Nous prĂ©sentons un algorithme effectuant ce calcul, TopPI. Nousmontrons que TopPI calcule efficacement des rĂ©sultats intĂ©ressants sur nos jeuxde donnĂ©es. Il est plus performant que des solutions naives ou des Ă©mulationsreposant sur des algorithmes existants, aussi bien en termes de rapiditĂ© que decomplĂ©tude des rĂ©sultats. Nous dĂ©crivons et expĂ©rimentons deux versions par-allĂšles de TopPI (l’une sur des machines multi-coeurs, l’autre sur des grappesHadoop) qui permettent d’accĂ©lerer le calcul Ă  grande Ă©chelle.Notre seconde contribution est CAPA, un systĂšme permettant d’étudier quellemesure de qualitĂ© des rĂšgles d’association serait la plus appropriĂ©e pour trier nosrĂ©sultats. Cela s’applique aussi bien aux rĂ©sultats issus de TopPI que de j LCM,notre implĂ©mentation d’un algorithme rĂ©cent de fouille d’ensembles frĂ©quents fer-mĂ©s (LCM). Notre Ă©tude quantitative montre que les 39 mesures que nous com-parons peuvent ĂȘtre regroupĂ©es en 5 familles, d’aprĂšs la similaritĂ© des classementsde rĂšgles qu’elles produisent. Nous invitons aussi des experts en marketing Ă  par-ticiper Ă  une Ă©tude qualitative, afin de dĂ©terminer laquelle des 5 familles que nousproposons met en avant les associations d’objets les plus pertinentes dans leurdomaine.Notre collaboration avec IntermarchĂ©, partenaire industriel dans le cadre duprojet Datalyse, nous permet de prĂ©senter des expĂ©riences complĂštes et por-tant sur des donnĂ©es rĂ©elles issues de supermarchĂ©s dans toute la France. NousdĂ©crivons un flux d’analyse complet, Ă  mĂȘme de rĂ©pondre Ă  cette application. NousprĂ©sentons Ă©galement des expĂ©riences portant sur des donnĂ©es issues d’Internet;grĂące Ă  la gĂ©nĂ©ricitĂ© du modĂšle des ensembles d’objets, nos contributions peuvents’appliquer dans d’autres domaines.Nos contributions permettent donc aux analystes de dĂ©couvrir des associations d’objets au milieu de grandes masses de donnĂ©es. Nos travaux ouvrent aussi lavoie vers la fouille d’associations interactive Ă  large Ă©chelle, afin d’analyser desdonnĂ©es hautement dynamiques ou de rĂ©duire la portion du fichier Ă  analyser Ă celle qui intĂ©resse le plus l’analyste

    The effect of enteral and parenteral feeding on secretion of orexigenic peptides in infants

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The feeding in the first months of the life seems to influence the risks of obesity and affinity to some diseases including atherosclerosis. The mechanisms of these relations are unknown, however, the modification of hormonal action can likely be taken into account. Therefore, in this study the levels of ghrelin and orexin A - peripheral and central peptide from the orexigenic gut-brain axis were determined.</p> <p>Methods</p> <p>Fasting and one hour after the meal plasma concentrations of ghrelin and orexin were measured in breast-fed (group I; n = 17), milk formula-fed (group II; n = 16) and highly hydrolyzed, hypoallergic formula-fed (group III; n = 14) groups, age matched infants (mean 4 months) as well as in children with iv provision of nutrients (glucose - group IV; n = 15; total parenteral nutrition - group V; n = 14). Peptides were determined using EIA commercial kits.</p> <p>Results</p> <p>Despite the similar caloric intake in orally fed children the fasting ghrelin and orexin levels were significantly lower in the breast-fed children (0.37 ± 0.17 and 1.24 ± 0.29 ng/ml, respectively) than in the remaining groups (0.5 ± 0.27 and 1.64 ± 0.52 ng/ml, respectively in group II and 0.77 ± 0.27 and 2.04 ± 1.1 ng/ml, respectively, in group III). The postprandial concentrations of ghrelin increased to 0.87 ± 0.29 ng/ml, p < 0.002 and 0.76 ± 0.26 ng/ml, p < 0.01 in groups I and II, respectively as compared to fasting values. The decrease in concentration of ghrelin after the meal was observed only in group III (0.47 ± 0.24 ng/ml). The feeding did not influence the orexin concentration. In groups IV and V the ghrelin and orexin levels resembled those in milk formula-fed children.</p> <p>Conclusion</p> <p>The highly hydrolyzed diet strongly affects fasting and postprandial ghrelin and orexin plasma concentrations with possible negative effect on short- and long-time effects on development. Also total parenteral nutrition with the continuous stimulation and lack of fasting/postprandial modulation might be responsible for disturbed development in children fed this way.</p

    Methane emission by Camelids

    Get PDF
    Methane emissions from ruminant livestock have been intensively studied in order to reduce contribution to the greenhouse effect. Ruminants were found to produce more enteric methane than other mammalian herbivores. As camelids share some features of their digestive anatomy and physiology with ruminants, it has been proposed that they produce similar amounts of methane per unit of body mass. This is of special relevance for countrywide greenhouse gas budgets of countries that harbor large populations of camelids like Australia. However, hardly any quantitative methane emission measurements have been performed in camelids. In order to fill this gap, we carried out respiration chamber measurements with three camelid species (Vicugna pacos, Lama glama, Camelus bactrianus; n = 16 in total), all kept on a diet consisting of food produced from alfalfa only. The camelids produced less methane expressed on the basis of body mass (0.3260.11 L kg21 d21) when compared to literature data on domestic ruminants fed on roughage diets (0.5860.16 L kg21 d21). However, there was no significant difference between the two suborders when methane emission was expressed on the basis of digestible neutral detergent fiber intake (92.7633.9 L kg21 in camelids vs. 86.2612.1 L kg21 in ruminants). This implies that the pathways of methanogenesis forming part of the microbial digestion of fiber in the foregut are similar between the groups, and that the lower methane emission of camelids can be explained by their generally lower relative food intake. Our results suggest that the methane emission of Australia’s feral camels corresponds only to 1 to 2% of the methane amount produced by the countries’ domestic ruminants and that calculations of greenhouse gas budgets of countries with large camelid populations based on equations developed for ruminants are generally overestimating the actual levels

    Invited review: Large-scale indirect measurements for enteric methane emissions in dairy cattle: A review of proxies and their potential for use in management and breeding decisions

    Get PDF
    Publication history: Accepted - 7 December 2016; Published online - 1 February 2017.Efforts to reduce the carbon footprint of milk production through selection and management of low-emitting cows require accurate and large-scale measurements of methane (CH4) emissions from individual cows. Several techniques have been developed to measure CH4 in a research setting but most are not suitable for large-scale recording on farm. Several groups have explored proxies (i.e., indicators or indirect traits) for CH4; ideally these should be accurate, inexpensive, and amenable to being recorded individually on a large scale. This review (1) systematically describes the biological basis of current potential CH4 proxies for dairy cattle; (2) assesses the accuracy and predictive power of single proxies and determines the added value of combining proxies; (3) provides a critical evaluation of the relative merit of the main proxies in terms of their simplicity, cost, accuracy, invasiveness, and throughput; and (4) discusses their suitability as selection traits. The proxies range from simple and low-cost measurements such as body weight and high-throughput milk mid-infrared spectroscopy (MIR) to more challenging measures such as rumen morphology, rumen metabolites, or microbiome profiling. Proxies based on rumen samples are generally poor to moderately accurate predictors of CH4, and are costly and difficult to measure routinely onfarm. Proxies related to body weight or milk yield and composition, on the other hand, are relatively simple, inexpensive, and high throughput, and are easier to implement in practice. In particular, milk MIR, along with covariates such as lactation stage, are a promising option for prediction of CH4 emission in dairy cows. No single proxy was found to accurately predict CH4, and combinations of 2 or more proxies are likely to be a better solution. Combining proxies can increase the accuracy of predictions by 15 to 35%, mainly because different proxies describe independent sources of variation in CH4 and one proxy can correct for shortcomings in the other(s). The most important applications of CH4 proxies are in dairy cattle management and breeding for lower environmental impact. When breeding for traits of lower environmental impact, single or multiple proxies can be used as indirect criteria for the breeding objective, but care should be taken to avoid unfavorable correlated responses. Finally, although combinations of proxies appear to provide the most accurate estimates of CH4, the greatest limitation today is the lack of robustness in their general applicability. Future efforts should therefore be directed toward developing combinations of proxies that are robust and applicable across diverse production systems and environments.Technical and financial support from the COST Action FA1302 of the European Union

    Neuroendocrine control of satiation

    Full text link

    Mining and ranking closed itemsets from large-scale transactional datasets

    No full text
    Les algorithmes actuels pour la fouille d’ensembles frĂ©quents sont dĂ©passĂ©s par l’augmentation des volumes de donnĂ©es. Dans cette thĂšse nous nous intĂ©ressons plus particuliĂšrement aux donnĂ©es transactionnelles (des collections d’ensembles d’objets, par exemple des tickets de caisse) qui contiennent au moins un million de transactions portant sur au moins des centaines de milliers d’objets. Les jeux de donnĂ©es de cette taille suivent gĂ©nĂ©ralement une distribution dite en "longue traine": alors que quelques objets sont trĂšs frĂ©quents, la plupart sont rares. Ces distributions sont le plus souvent tronquĂ©es par les algorithmes de fouille d’ensembles frĂ©quents, dont les rĂ©sultats ne portent que sur une infime partie des objets disponibles (les plus frĂ©quents). Les mĂ©thodes existantes ne permettent donc pas de dĂ©couvrir des associations concises et pertinentes au sein d’un grand jeu de donnĂ©es. Nous proposons donc une nouvelle sĂ©mantique, plus intuitive pour l’analyste: parcourir les associations par objet, au plus une centaine Ă  la fois, et ce pour chaque objet prĂ©sent dans les donnĂ©es.Afin de parvenir Ă  couvrir tous les objets, notre premiĂšre contribution consiste Ă  dĂ©finir la fouille centrĂ©e sur les objets. Cela consiste Ă  calculer, pour chaque objet trouvĂ© dans les donnĂ©es, les k ensembles d’objets les plus frĂ©quents qui le contiennent. Nous prĂ©sentons un algorithme effectuant ce calcul, TopPI. Nous montrons que TopPI calcule efficacement des rĂ©sultats intĂ©ressants sur nos jeux de donnĂ©es. Il est plus performant que des solutions naives ou des Ă©mulations reposant sur des algorithms existants, aussi bien en termes de rapiditĂ© que de complĂ©tude des rĂ©sultats. Nous dĂ©crivons et expĂ©rimentons deux versions parallĂšles de TopPI (l’une sur des machines multi-coeurs, l’autre sur des grappes Hadoop) qui permettent d’accĂ©lerer le calcul Ă  grande Ă©chelle.Notre seconde contribution est CAPA, un systĂšme permettant d’étudier quelle mesure de qualitĂ© des rĂšgles d’association serait la plus appropriĂ©e pour trier nos rĂ©sultats. Cela s’applique aussi bien aux rĂ©sultats issus de TopPI que de jLCM, notre implĂ©mentation d’un algorithme rĂ©cent de fouille d’ensembles frĂ©quents fermĂ©s (LCM). Notre Ă©tude quantitative montre que les 39 mesures que nous comparons peuvent ĂȘtre regroupĂ©es en 5 familles, d’aprĂšs la similaritĂ© des classements de rĂšgles qu’elles produisent. Nous invitons aussi des experts en marketing Ă  participer Ă  une Ă©tude qualitative, afin de dĂ©terminer laquelle des 5 familles que nous proposons met en avant les associations d’objets les plus pertinentes dans leur domaine.Notre collaboration avec IntermarchĂ©, partenaire industriel dans le cadre du projet Datalyse, nous permet de prĂ©senter des expĂ©riences complĂštes et portant sur des donnĂ©es rĂ©elles issues de supermarchĂ©s dans toute la France. Nous dĂ©crivons un flux d’analyse complet, Ă  mĂȘme de rĂ©pondre Ă  cette application. Nous prĂ©sentons Ă©galement des expĂ©riences portant sur des donnĂ©es issues d’Internet; grĂące Ă  la gĂ©nĂ©ricitĂ© du modĂšle des ensembles d’objets, nos contributions peuvent s’appliquer dans d’autres domaines.Nos contributions permettent donc aux analystes de dĂ©couvrir des associations d’objets au milieu de grandes masses de donnĂ©es. Nos travaux ouvrent aussi la voie vers la fouille d’associations interactive Ă  large Ă©chelle, afin d’analyser des donnĂ©es hautement dynamiques ou de rĂ©duire la portion du fichier Ă  analyser Ă  celle qui intĂ©resse le plus l’analyste.The recent increase of data volumes raises new challenges for itemset mining algorithms. In this thesis, we focus on transactional datasets (collections of items sets, for example supermarket tickets) containing at least a million transactions over hundreds of thousands items. These datasets usually follow a "long tail" distribution: a few items are very frequent, and most items appear rarely. Such distributions are often truncated by existing itemset mining algorithms, whose results concern only a very small portion of the available items (the most frequents, usually). Thus, existing methods fail to concisely provide relevant insights on large datasets. We therefore introduce a new semantics which is more intuitive for the analyst: browsing associations per item, for any item, and less than a hundred associations at once.To address the items' coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset, the k most frequent closed itemsets containing this item. We present an algorithm to solve it, TopPI. We show that TopPI computes efficiently interesting results over our datasets, outperforming simpler solutions or emulations based on existing algorithms, both in terms of run-time and result completeness. We also show and empirically validate how TopPI can be parallelized, on multi-core machines and on Hadoop clusters, in order to speed-up computation on large scale datasets.Our second contribution is CAPA, a framework allowing us to study which existing measures of association rules' quality are relevant to rank results. This concerns results obtained from TopPI or from jLCM, our implementation of a state-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantitative study shows that the 39 quality measures we compare can be grouped into 5 families, based on the similarity of the rankings they produce. We also involve marketing experts in a qualitative study, in order to discover which of the 5 families we propose highlights the most interesting associations for their domain.Our close collaboration with IntermarchĂ©, one of our industrial partners in the Datalyse project, allows us to show extensive experiments on real, nation-wide supermarket data. We present a complete analytics workflow addressing this use case. We also experiment on Web data. Our contributions can be relevant in various other fields, thanks to the genericity of transactional datasets.Altogether our contributions allow analysts to discover associations of interest in modern datasets. We pave the way for a more reactive discovery of items' associations in large-scale datasets, whether on highly dynamic data or for interactive exploration systems

    TopPI: An efficient algorithm for item-centric mining

    Get PDF
    International audienceIn this paper, we introduce item-centric mining, a new semantics for mining long-tailed datasets. Our algorithm, TopPI, finds for each item its top-k most frequent closed itemsets. While most mining algorithms focus on the globally most frequent itemsets, TopPI guarantees that each item is represented in the results, regardless of its frequency in the database. TopPI allows users to efficiently explore Web data, answering questions such as " what are the k most common sets of songs downloaded together with the ones of my favorite artist? ". When processing retail data consisting of 55 million supermarket receipts, TopPI finds the itemset " milk, puff pastry " that appears 10,315 times, but also " frangipane, puff pastry " and " nori seaweed, wasabi, sushi rice " that occur only 1120 and 163 times, respectively. Our experiments with analysts from the marketing department of our retail partner, demonstrate that item-centric mining discover valuable itemsets. We also show that TopPI can serve as a building-block to approximate complex itemset ranking measures such as the p-value. Thanks to efficient enumeration and pruning strategies, TopPI avoids the search space explosion induced by mining low support itemsets. We show how TopPI can be parallelized on multi-cores and distributed on Hadoop clusters. Our experiments on datasets with different characteristics show the superiority of TopPI when compared to standard top-k solutions, and to Parallel FP-Growth, its closest competitor
    • 

    corecore